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Abstract 



j^ . Supervised linear feature extraction can be achieved by fitting a reduced 

^ I rank multivariate model. This paper studies rank penalized and rank con- 

strained vector generalized linear models. From the perspective of thresh- 
^r^ ■ olding rules, we build a framework for fitting singular value penalized mod- 

K> i els and use it for feature extraction. Through solving the rank constraint 

form of the problem, we propose progressive feature space reduction for 
^ ■ fast computation in high dimensions with little performance loss. A novel 

cn \ projective cross-validation is proposed for parameter tuning in such non- 

C^^ I convex setups. Real data applications are given to show the power of the 

^ ■ methodology in supervised dimension reduction and feature extraction. 
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1 Background 

Recently, high dimensional data analysis attracts a great deal of interest from 
statisticians. The availability of a large pool of variables (relative to the sam- 
ple size) poses challenges in statistical modeling because in this high-dimensional 
setup, both estimation accuracy and model interpretability can be seriously hurt. 
Dimension reduction is a natural and effective means to reduce the number of 
unknowns. One can remove nuisance and/or redundant variables, referred to as 



variable/feature selection; alternatively, one can find low dimensional linear or 
nonlinear projections of the input data, referred to as feature extraction. In this 
paper, we focus on linear feature extraction for dimension reduction purposes. 

The most popular approach for linear feature extraction is perhaps the prin- 
ciple component analysis (PCA). Given X E M"^^ with n observations and p 
features, perform the Singular Value Decomposition (SVD) on the data X = 
UDV'^. Given any 1 < r < rank{X), denote by Vr the submatrix of V con- 
sisting of its first r columns. Then 

constructs r new features as linear combinations of the original features. The ex- 
traction is optimal in the sense that B = XVvr = XVrVj, with Vvr being the 
projection matrix onto the column space of Vr, gives the best rank-r approxima- 
tion to X: 

B = aig min ||JC — S|||n, 

B:rank{B)<r 

where || ■ ||i? is the Frobenius norm. A by-product is that the gram matrix of XV r 
is diagonal, which means all new features are uncorrelated with each other. 

On the other hand, PCA is unsupervised. In many statistical learning prob- 
lems, we want to construct new features that best predict the responses. Suppose 
Y E M"^™ is the response matrix, n being the sample size and m being the 
number of response variables. Supervised feature extraction can be given by the 
reduced rank regression (RRR) (Anderson 1951), with the RRR estimator B de- 
fined by 

B = arg min ||^ — ^-^IIf- 

B:rank{B)<r 

Assume X has full column rank and define H = X {X^ X)^^ X^ . Then B = 
BoisVrVj. , where Bois = {X X)X Y. Vr is formed by the first r columns 
of V from the spectral decomposition Y^HY = VDV^ . See Reinsel & Velu 
(1998) for more details. Therefore, 

Zr = X{BolsVr) 

constructs r new features that best approximate Y in Frobenius norm , and these 
new features are, again, uncorrelated. The RRR framework includes the PCA as 



a special case, by setting Y = X (Izenman 2008). The above RRR solves a 
nonconvex optimization problem in the classical setup n > p. Recently, Bunea, 
She & Wegkamp (201 1) studied the problem under p > n and developed finite- 
sample theories as well as a computational algorithm. 

On the other hand, the squared error loss may not always be appropriate. For 
vector generalized linear models (GLMs), such as discrete responses arising in 
classification problems, deviance loss is much more reasonable. 

Although there is a large body of literature on the RRR — Robinson (1974), 
Rao (1979), and Brillinger (1981), to name a few, to the best of our knowledge, 
there is very little work beyond the Gaussian model. Yee & Hastie (2003) studied 
the reduced-rank vector GLM problem and used an iterative approximate estima- 
tion by fitting RRR repeatedly. Yet this only provides an approximation solution 
to the original problem and there is not guarantee of converge. Heinen & Rengifo 
(2008) resorted to a continuation technique to deal with discrete responses. 

This paper aims to tackle the penalized and constrained vector GLMs 

min — log-likelihood(S; Y, X) H rank{B), and (1.1) 

B 2 

min — log-likelihood(-B; Y, X) s.t. rank(B) < r. (1.2) 

The imposed reduced rank structure is based on the belief that the features' rele- 
vant directions, in response to Y, define a lower dimensional subspace in W. The 
rank of such an estimator determines the number of new features to construct. 
These two problems are not equivalent to each other due to their nonconvexity. 
In fact, the rank function is nonconvex and discrete (and thus nondifferentiable), 
thereby posing a challenge in optimization. Our algorithms boil down to an itera- 
tive version, which is not surprising in the GLM setup. 

The rest of the paper is organized as follows. Section 2 starts by studying a 
matrix approximation problem, and then builds a framework for fitting singular- 
value penalized multivariate GLMs. Supervised feature extraction can be attained 
for non-Gaussian models not necessarily using the squared error loss. The frame- 
work covers a wide family of penalty functions. A new parameter tuning strategy 
is proposed. Section 3 tackles the rank constrained GLM problem and comes 
up with a feature space reduction technique. Through this, (11.11 ) and (11.21 ) can 
be combined to achieve better estimation accuracy and computational efficiency. 
Section 4 illustrates real applications of the proposed methodology. We conclude 
in Section 5. All technical details are left to the Appendix. 



2 Penalized Vector GLMs for Feature Extraction 

In this section, we study the penalized form reduced rank GLMs (11.11) . Our algo- 
rithm and analysis apply to p > n situations and cover a large family of singular- 
value penalties, including nuclear norm, Frobenius norm, Schatten p-penalties 
(0 < p < 1), and rank penalty. To achieve such generality, we start by studying a 
simpler matrix approximation problem. 

2.1 Singular- value penalized matrix approximation 

We consider the problem of matrix approximation with a singular value penalty 

mmi||r-S||| + 5^P(af);A) (2.1) 

i 

where cr| ' denote the singular values of B. The choice of the penalty function 
P is flexible. For example, P(t; A) = A|t| gives a multiple of the sum of singular 
values corresponding to the trace norm or nuclear norm penalty. When P(t; A) = 
A^lt^o/2, we get the rank penalty which is discrete and nonconvex. For a general 
P, the closed-form solution to (12.11 ) is not known (to the best of our knowledge). 
We address the problem from the standpoint of threshold functions. 

Definition 2.1 (Threshold function). A threshold function is a real valued function 
9(t; A) defined for — oo < t < oo and < A < oo such that 

1. e(-t;A) = -e(t;A), 

2. 0(t;A) <e(t';A)fort<t', 

3. limt-^oo 0(i; A) = oo, and 

4. < e(t;A) < tforO < t < cx). 

Remarks, (i) A vector version of (still denoted by 0) is defined componen- 
twise if either t or A is replaced by a vector, (ii) There may be some ambiguity 
in defining a threshold function. For example, the hard-thresholding can be de- 
fined as 0//(t; A) = tl|f|>A or Qnit; A) = tl|t|>A- Fortunately, commonly used 
thresholding rules have at most finitely many discontinuity points and such dis- 
continuities rarely occur in real data. When applying to a quantity (say t), we 
always make the implicit assumption that is continuous at t. (iii) By definition. 



6~^(m; A) = sup{t : 6(t; A) < m}, Vm > must be monotonically increasing on 
(0, oo) and bounded between the identity line and u = 0; its derivative is defined 
almost everywhere on (0, oo). We assume that 

dQ^^{u; A)/ du>l — Ce a.e. on (0, oo) 

for some constant Ce E [0,1] independent of A. (In fact, for all convex penalties 
constructed through (12.31 ). Cq can be set to be 0.) 
Next we introduce the matrix thresholding. 

Definition 2.2 (Matrix threshold function). Given any threshold function 6(-; A), 
its matrix version 6^^ is defined as follows 

e"(B; A) = C/diag{e(ap^; \)}V^ , WB e R"^*" (2.2) 

where U, V, and diagjaj^^ } are obtained from the SVD of B: B = C/^diagjap^ } V. 

Note that 6(0; A) = by definition, and 6'^(B; A) is not affected by the 
ambiguity of the SVD form. 

Proposition 2.1. Given an arbitrary thresholding rule 0, let P be any function 
satisfying 

r\e\ 
P{0;X)-P{O;X)= {sup{s:e{s;\k)<u}-u)du + q{9;X), (2.3) 

Jo 

where g(-;A) is nonnegative and g(9(t;A);A) = for all t G M. Then, the 
singular-value penalized minimization 

mm F{B) = ||r - B||^/2 + ^P(aP^ A) (2.4) 

has a unique optimal solution B = 6'^(1^; A) for every Y, provided 6(-; A) z^ 
continuous at any singular value of Y. 

See Appendix lAl for its proof. 

The function q is often just zero, but can be nonzero in certain cases. In fact, 
we can use a nontrivial q to attain the exact rank penalty; see ( 12.71 ). The propo- 
sition implies multiple (infinitely many, as a matter of fact) penalties can result 
in the same solution, which justifies our thresholding launching point (rather than 
a penalty one). Some examples of the penalty P and the coupled are listed in 
Tabled 

We point out two special cases of Proposition |2.1| as follows. 



Table 1 : Some basic singular- value penalties and their coupled thresholding functions. 

Nuclear norm Frobenius Rank Schatten-p, p G (0, 1) 

Penalty function A||B||, = A^(j|^^ A||B||| ^rank{B) A^K^^V 

Thresholding rule (t — sgn(t)A)l|(|>;^ j^ *l|t|>A Ex 2.7 in She (2012) 

(soft) (ridge) (hard) 



A fusion between nuclear norm and Frobenius norm Define a continuous 
thresholding rule 

{0, if|t|<A 

t-Asgn(t), ifA<|t|<A + M (2.5) 

-V, if|t|>A + M. 

When M — )• oo, 6^ becomes the soft-thresholding. When M = 0, Qb reduces 
to the ridge thresholding. The penalty constructed from (|2.3I) is given by 

p{e-\,M) = I [J' 

l^^W^> if|^|>M, 

which is exactly the 'Berhu' penalty (Owen 2007) whose composition reverses 
that of Ruber's robust loss function. The Berhu penalty on the singular values 
provides a convex fusion of the nuclear norm penalty and the Frobenius norm 
(squared) penalty in the problem of (12.41) . Unlike the elastic net (Zou & Hastie 
2005), this fusion is nonlinear and fully preserves the nondifferentiable behavior 
(around zero) of the nuclear norm. 

A fusion between rank and Frobenius norm A direct thresholding rule that 
fuses the hard-thresholding and the ridge-thresholding is the hard-ridge thresh- 
olding (She 2009) 



0, if|t|<A 

if Itl > A. 



eHR{t;X,v) = { \ ,M,^ , (2.6) 



l+r)- 

Setting g = in Proposition [2T1 we obtain one associated penalty 
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Interestingly, noticing that Qhr is discontinuous at A, we can choose 



(i+n)(x-\e\r 



ifO< 1^1 < A 



0, if 6* = Oor |6'| > A 

which leads to P{9) = ^r]9'^ + ^ j^l^^^o- Therefore, Qhr{-] A, r/) can deal with 
the following rank-Frobenius penalty in (12.41) 



1 1 A^ 

-V\\B\\1 + rank(S). (2.8) 

This penalty may be of interest in statistical learning tasks that have joint con- 
cerns of accuracy and parsimony: the rank portion enforces high rank deficiency, 
while the ridge (Frobenius) portion shrinks B to compensate for large noise and 
decorrelates the input variables in large-p applications. 

At the end of this subsection, we present a perturbation result which will be 
used to establish the main result in the next subsection. 

Proposition 2.2. Given Y e W"""", let Q{B) = \\Y - B\\l/2 + Y^ Pe{<^f^-, A), 
where Pq is the penalty obtained from (|2.3I) . Denote by B the minimizer ofQ{B). 
Then for any matrix A G M"^™ 

g(s + A)-g(s)>^l|Ai||, 

where Ci = 1 — C^ > 0. 

See Appendix |B] for its proof. 

2.2 Singular-value penalized vector GLMs 

In this subsection, we generalize the results obtained for matrix approximation to 
vector GLMs. 

Let Y = [?/i, ■ ■ ■ ^Vm] ^ IR"^^'" be the response matrix with m response 
variables and n samples for each. Assume ?/j ^ are independent and each follows a 
distribution in the natural exponential family /{yi^, 9i^k) = Gw{yi,k9i,k — K^i,k) + 
c(2/i,fc)), where 6*^,^ is the natural parameter. Let Li^k = log/(yi,fc, 6^^,^), L = 
'^kJ2i^i,k- The canonical link function g = {b')~^ is applied throughout the 
paper. Let the model matrix and the corresponding coefficient matrix be 

X = [xi, ■ • • , x^f = [xo, ii, ■ ■ ■ , Xp] G M"x(P+i) and 
B = [(3„ ■ ■ ■ , /3J = [^0, ^1, ■ ■ ■ , ~0f e M(P+i)x-, 
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respectively. If Xq = 1, /3q represents the intercept vector. For convenience, we 
use B° = [/3]^, • • ■ , /3p]^ to denote the submatrix of B obtained by deleting the 

first row /3o , and use X° to denote the submatrix of X obtained by deleting the 
first column Xq. Given any GLM with coefficients /3, we introduce 

/x(/3) 4 [g-\xJ(3)Ui andX(/3) 4 X^WX = X^ di^g {b" (xj (3)}l^ X 

to denote the mean vector and the information matrix at /3. For the m-response 
vector GLM, the mean matrix ^i(S) = [/ijjj]^^^ is defined as [/x(/3^), ■• • ,/^(/3^)]. 

Remarks, (i) Having Xq and /3q is necessary. For non-Gaussian GLMs, one 
cannot center the response variables because this may violate the distributional as- 
sumption, (ii) For clarity, the above setup does not include any dispersion parame- 
ter. But all discussions in this subsection can be trivially extended to the exponen- 
tial dispersion family f{yi^k; Oi^k, 0) = exp{(?/i,fc6'i,fc - 6(6'i,fc))/a(0) + c{yi^k, </>)} 
which covers the vector Gaussian regression. 

Our goal is to minimize (11.11) or more generally, the following objective func- 
tion 

m n p/\m 

F{B) 4 -5^^L,fc(/3,;a.„2/,fc) + J]P(af°);A) (2.10) 

fc = l J = l s=l 

for a large family of penalty functions (possibly nonconvex). The penalty is not 
imposed on /3q. 

We construct the following sequence of iterates for solving the problem: given 
B^^', perform the update 

'^oO'+i) _ ea(^oo-) ^ -^oTy _ X°^/x(S(^')); A), 

Theorem 2.1. Given an arbitrary thresholding rule O, let P{-) be any function 
satisfying (l23l) . Starting with any B^^^ G M^p+i)^™, run (12.111) to obtain a se- 
quence {B^^)}. Denote by A^ the set of {t^^^^ + (1 - t)(3^i^^^ : t G (0, 1), j = 
1, 2, ■ ■ ■ }, 1 < A; < i^, and define 

p= max sup ||X(|fc)||2. 

Suppose p < 2 — Cq. Then F{B^^') is decreasing and satisfies 

i7^(^(i))_i7^(sO-+i)) >C'||b(^')-B(^'+i)|||/2, j = 1,2,--- (2.12) 



where C = 2 — Cq — p. Any limit point B* of the sequence B^^', referred to as a 
Q" -estimator, is a solution to the following equation: 

- 0-(S° + X°^Y - X^^fiiB); A) 
(r-M(S)fio, 

under the assumption that is continuous at all singular values ofB°*+X° Y— 

The proof details are given in Appendix O 

Recall that Cq < 1. In implementation, we can scale the model matrix by 
X/ko for any ko > y/p regardless of 6, and then perform (12.111) . The 9°^- 
estimate, obtained on the scaled data, can be scaled back to give an estimate on 
the original X. The procedure has a theoretical guarantee of convergence and 
(12.121) yields a good stopping criterion based on the change in B^^\ Empiri- 
cally, we always observe B^^' has a unique limit point. Similar to She (2012), 
when it is possible to explicitly calculate the curvature parameter ^e^ say for 
SCAD or soft-thresholding, we recommend using the smallest possible value of 
kQ = a/p/(2 — Cq), which significantly speeds the convergence of the algorithm 
based on extensive experience. (For example, with a convex penalty we can set 
ko = a/p/2.) We give two typical situations to find an upper bound for p in theory. 

Example 2.1 (Penalized Vector Gaussian GLM). For Gaussian regression, we 
can ignore the intercept term (after centering both responses and predictors be- 
forehand), and the objective function (12.101) becomes 

pAm 

||r-XS|||/2 + ^P((Tf);A). (2.14) 

s=l 

(12.111) reduces to 

^ij+i) = e'^(^(i) + x^Y - X^XB, A). (2.15) 

Here, X = X^WX = X^X. According to the theorem, /cq can be chosen to be 
II X II 2 regardless of the thresholding rule and the penalty, where || ■ II2 denotes the 
spectral norm. 

In the special case of imposing a direct rank penalty, where X]s=T -^ (^s , A) = 
^rank{B), another computational procedure based on the classical RRR algo- 
rithm can be used. In fact, RRR studies a relevant but different problem, with 



no penalty but subject to a low rank constraint. But we can adapt the proce- 
dure to minimizing \\Y — XB\\p/2 + A^/2 ■ rank(S) as follows (cf. Bunea, 
She & Wegkamp (2011)). Suppose X^X is nonsingular and H is the hat ma- 
trix X{X'^X)-^X'^. (i) Apply spectral decomposition to Y^HY: Y^HY = 
VDW^ where D = diag{rfi, ■■■ ,dm} with rfi > ^2 > ■ ■ ■ > ^m > 0. (ii) 
Given any value of A, define r = max{i : rfj > A} and Vr = V[ , l:r], by taking 
the first r columns in V. (iii) Then the (globally) optimal solution is given by 

^(A) = {X^xy^X^YVvr = {X^X)-^X^YVrVj, 

where Vvr is the orthogonal projection onto the column space of Vr- We can 
show the 6'^-estimate defined by (12.151) reduces to the RRR estimate in this case, 
the proof details given in Appendix iDl 

Proposition 2.3. Suppose X g M"^^ (n > p) has full column rank and ||^||2 < 
1. Then the RRR estimate B{X) constructed above must satisfy the Q'^ -equation 
B = Q^iB + X Y ~ X XB] A) for matrix hard-thresholding 0^. 

Unlike RRR, our algorithm and convergent analysis do not require X to have 
full rank oxn> p. In comparison with the large-p RSC (Bunea, She & Wegkamp 
201 1), (12.111) applies to any (and covers all vector GLMs). 

Example 2.2 (Penalized Vector Logistic GLM). Assume a classification setup 
where yik are all binary. The singular-value penalized vector logistic regression 
minimizes 

m n pArn 

- E E (y^^'^^ff^k - log(l + exp{xjf3,))) + J2 Pi^f°^; A). (2.16) 

fc = l j=l s = l 

The first iteration step in (12.111) becomes 

B°U+i) = Q-^B°^J) + x°^Y - X°^ [l /(I + eM-^lPf))] ; A). 

L J nxm 

(2.17) 

In R, the matrix fJ,{B) can be simply constructed by 1/ (1+exp (-X%*%B) ) . 
Since W{l3) = disig{b" {xf (3)} = diag{7ri(l - vTi)} ^ I/A with vr^ = 1/(1 + 
exp{—xf(3)), a crude but general choice of the scaling constant is ko > ||X||2/2, 
again, regardless of and A. Yet in applying a convex penalty such as the nuclear 
norm penalty, we can use fco = 11^112/(2-^2) to speed the convergence. 
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Some related works There has been a surge of interest in nuclear norm penal- 
ization recently, in which case the penalty in (12.101) simplifies to a multiple of the 
sum of all singular values of B° or A||B°||*. This gives a convex optimization 
problem. In the statistics community, Yuan et al. (2007) seem to be the first to 
study the nuclear norm penalized least squares estimator. A popular equivalent 
formulation of the nuclear norm minimization in optimization is through semidef- 
inite programming (SDP) (Fazel 2002). See, e.g., Candes & Recht (2009), Candes 
& Tao (2010), and Ma et al. (2009) for some recent theoretical and computational 
achievements. 

Although the nuclear norm provides a convex relaxation to the rank penalty, 
this approximation works only under certain regularity conditions (e.g., Candes & 
Plan (2011)). Bunea, She & Wegkamp (2011) show that direct rank penalization 
achieves the same oracle rate in a much less restrictive manner. Yet in addition to 
the reduced rank regression studies (see Example 12. II) . there have been very few 
attempts to extend the rank penalization beyond the Gaussian framework. Two 
commonly cited works are Yee & Hastie (2003) and Heinen & Rengifo (2008). 
See Section [H for their limitations. In comparison with these works, our matrix 
thresholding algorithm has a theoretical guarantee of convergence, is simple to 
implement, and covers a wide family of penalty functions as well as loss functions. 

Finally, we point out a major difference between the thresholding-based itera- 
tive selection procedures (TISP) (She 2009) and the proposed algorithm which can 
be referred to as matrix-TISP . TISP aims for variable selection in a single-response 
model, while here we discuss singular value regularization in vector GLMs. The 
singular- value sparsity or low rankness, different than coefficient sparsity, offers a 
new type of parsimony that can be used for supervised feature extraction. It brings 
a true multivariate flavor into our analysis. 

Feature extraction In many high-dimensional problems, /ea?Mre extraction, by 
transforming the input variables and creating a reduced set of new features, is 
a useful technique for dimension-reduction. For example, PCA considers linear 
projections of correlated variables to construct new orthogonal features ordered 
by decreasing variances. For singular-value penalized models, once a low-rank 
estimate B is obtained, one can attain the same goal. Suppose the rank of B is r. 
A direct way is to apply the reduced form SVD on B, getting JB = UDV^ with 
D anr X r diagonal matrix. Next, construct a new model matrix 

Type-I: Z = XU {or XUD) (2.18) 
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which has only r new predictors. We refer to this as Type-I extraction. It can be 
used for parameter tuning later. 

On the other hand, it may be preferred to work on XJB in some situations. 
Perform the spectral decomposition B X^XB = VDV^ , where V is an m x r 
orthogonal matrix. It follows that XB = XBVV^. Therefore, for the new 
design matrix Z defined by 

Type-II (or Post-Decorrelation): Z = X{BV) G M"""", (2.19) 

each column (^-predictor) can be represented as a linear combination of the columns 
of X, and the r newly obtained 2 -predictors are uncorrelated with each other, i.e., 
Z'^ Z is diagonal. We refer to this as Type-II extraction or post-decorrelation. 
Type-I and Type-II are not equivalent in general (but coincide for the RRR esti- 
mate). The (linear) feature extraction is supervised and the corresponding dimen- 
sion reduction can be dramatic when r is much smaller than p. 

Initial point When the problem (12.101) is convex, we can further show (based 
on Theorem |2.1|) that any ©'^-estimate is a global minimum point. In this case, the 
choice of the initial estimate -B*^"^ is not essential and a pathwise algorithm with 
warm starts can be used in computing the solution path B{\) for a series of values 
of A. However, for nonconvex problems we do not have such global optimality 
given any initial point B^ . Although one can try multiple random starts, we found 
that empirically, simply setting B^^' to be the zero matrix leads to a solution with 
good statistical performance. Intuitively, this looks for a local optimum that is 
close to zero. Of course, other initialization choices are possible. 

Parameter tuning The challenge still comes from nonconvexity. Take the rank 
penalty as an example. The solution path B{\) is discontinuous, while the opti- 
mal penalty parameter A (as a surrogate for the Lagrange multiplier in convex 
programmings) is a function of both the data (X, Y) and the true coefficient 
B. Therefore, plain cross-validation with respect to A does not seem to be ap- 
propriate, as slightly perturbed data may result in serious regularization param- 
eter mismatches. We propose to cross-validate the range space of the low rank 
estimator (as a function of A) and refer to it as the projective cross-validation 
(PCV). In the following, we focus on the rank-Frobenius penalty (12.81) and the 
associated hard-ridge thresholding rule (12.61) to describe the idea. Let ^ be a 



9^R estimator obtained from (|2.11|) and write it as 
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~0l 
B° 



following our pre- 



vious notation (cf. (I2.9I )). For the penalized part B , denote its rank by r and 



its SVD by B = UDV^ with D e W''. Construct r new features (Type- 
I) Z° = X°U and set Z = [xq, Z°] = [zi, ■ ■ ■ , Znf. Let C° = DV^ and 

C = [k, cY = [ci, ■■■ ,c,^]e M('^+i)x-. 

Proposition 2.4. Under the condition on p given in Theorem \2.1\ for any 0^^- 
estimator B, C defined above is a Frobenius penalized estimator associated with 
new model matrix Z, i.e., 

m n 

C G arg min - ^^ ^1.^,^(0^; 2^,^/^^) + -||C°|||. (2.20) 

fc=l i=l 

See its proof in Appendix IB When t] > or Z has full column rank, the 
optimization problem (12.201 ) is strictly convex and so C is the unique optimal 
solution. 

The proposition implies that once U is extracted, we can simply use maximum 
likelihood estimation on the projected data to obtain the rank penalized estimator, 
or ridge penalized estimation to obtain the rank-Frobenius penalized estimator. 
There is no need to run the more expensive reduced rank fitting algorithms. 

We now state the K-fo\d PCV procedure for tuning the rank penalty parameter 
in(fLT]). 

1. Run Algorithm (12.111) on the whole dataset for a grid of values for A. The 
solution path is denoted hy B(\i), I = 1, ■ ■ ■ , L. 

2. Obtain L candidate models via (12.181) . each with a new model matrix Z{1) = 
XU{1), 1<1<L. 

3. Compute the cross-validation error for each model. Concretely, partition 
the sample index set into K (roughly) even subsets 7i, ■ ■ ■ Tk- Given Z(l), 
fit a vector GLM on the data without the subset indexed by Tk, and evaluate 
its validation error (measured by deviance) on the left-out subset. In all, K 
maximum likelihood estimates are obtained and their validation errors are 
summed up to yield the CV error of the candidate model Z(l). Repeat this 
for all / : 1 < / < L. 

4. Find the optimal model that minimizes the CV error. 

In the pursuit of a parsimonious model with very low rank, a BIC penalty term 
can be added to the CV error (She 2012). This is necessary in the large-p setup 
(Chen & Chen 2008). 
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PCV is much more efficient than CV because the more involved reduced rank 
fitting algorithm runs only once beforehand, rather than K times in the CV train- 
ings. The ML fitting in Step 3, justified by Proposition I2.4[ involves very few 
predictors. Another benefit of PCV is that the parameter mismatch issue is elimi- 
nated and all K trainings are regarding the same model and feature space. 

When there is an additional ridge parameter (cf. (12.81 )). the procedure still 
applies, but a two-dimensional grid for (A, i]) has to be used. Fortunately, accord- 
ing to our experience, the statistical performance is not very sensitive to small 
changes in the ridge parameter and we can choose a sparse grid for it. Step 3 now 
fits a series of /2-penalized GLMs. But again, this type of problems is smooth and 
convex; Newton-based algorithms are reasonably fast. Finally, we mention that 
PCV shares some similarities to the selective cross-validation (SCV) proposed for 
variable selection (She 2012). 

3 Rank Constrained Vector GLMs for Feature Space 
Reduction 



In this section, we study the reduced rank GLMs in constraint form (cf. (11.21) ). 
For any r > 1 and 77 > 0, the problem of interest is 

m n 

min -VVL,,fc(/3,;a3„|/,fc) + ^||S°||| s.t. rank{B°) < r, (3.1) 

k=l i=l 

The additional Frobenius norm penalty is to handle coUinearity. Again, neither 
the penalty nor the constraint is imposed on the first row of B. 

We introduce a quantile thresholding rule 6*(-; r, 77) as a variant of the hard- 
ridge thresholding. Given 1 <r <p and 77 > 0, 9* (a; r, r]) -.W ^ W is defined 
for any a E W' such that the r largest components of a (in absolute value) are 
shrunk by a factor of (1 + r^) and the remaining components are all set to be zero. 
In the case of ties, a random tie breaking rule is used. The matrix version of 0* 
is defined as 

Q*''{B- A) = C/diag{e#([a{^^]; r, r/)}V^, MB G M^^™ 

where U , V, and diag{cr| ' } are obtained from the SVD of B: B = C/^diag{(j| } V. 
Then, a simple procedure similar to (12.111 ) can be used to solve (13.11) : given 
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B''^\ perform the update 

Starting with any B^^^ G M(p+i)>^™, denote the sequence obtained from (|3.2I) by 
{S*^^-*}. Let F be the objective function (I3.1I ). Define p as in Theorem 12.11 

Theorem 3.1. If p < 1, F{B^^') is decreasing and satisfies 

F{B^^^) - F{B^^+^^) > (1 - p)\\B^^^ - B^^+^^\\l/2, 
andrank{B°'^^^) < r,Wj > 1. 

See Appendix |F] for its proof. The preliminary scaling of X /k^ for any k^ > 
y^ guarantees the convergence. Still, PCV can be used for parameter tuning and 
model selection although the obtained estimate may not be globally optimal due 
to nonconvexity. 

Rank penalty vs. rank constraint We have developed algorithms (|2.11l) and 
(Il2l) for solving (fTTl) (or (12.101) ) and (fOl) (or (I3l1) ). respectively. The obtained 
estimates are usually local optimizers of the corresponding objective functions. 
However, in non-Gaussian GLM setups, we found that the nonconvexity of either 
problem can be very strong. For instance, there may exist many local optima all 
having the same rank but spanning different subspaces in W^. In consideration 
of this, the penalized solution path ^(A) (0 < A < +oo) may provide more 
candidate models of certain rank (if existing) than the constrained solution path 
B{r) (r = 1, 2, ■ ■ ■ ,p A m), which is advantageous in the stage of parameter 
tuning. This phenomenon is often observed in datasets where p is comparable to 
or larger than n. Note that typically the direct rank penalized -B(A) has no rank 
monotonicity. 

On the other hand, computing the solution path for the penalty form is often 
more time-consuming in large-p applications. The path -B(A) has jumps. As- 
suming no prior knowledge of the appropriate interval for A, one has to specify a 
large search grid fine enough to cover a reasonable number of candidate models. 
By contrast, for the problem of constraint form, we can set a small upper bound 
for r in pursuing a low rank model (say, r < O.Sn A p Am could be good), and 
the natural grid spacing is 1. With the grid focusing on small values of r (which 
amounts to applying large thresholds in the iteration steps). Algorithm (13.21) runs 
efficiently. 
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Feature space reduction To combine the virtues of both approaches, we pro- 
pose to solve the rank constrained problem to perform feature space reduction, 
and then run the rank penalized algorithm in the reduced feature space. This is 
very helpful in large-p applications. A crude sketch is as follows. First, we set 
r = an A p A m with a < 1 (e.g., a = 0.5) and solve (|1.2I) . Using the esti- 
mate B{r), we execute Type-I feature extraction (12.181 ) to construct a new model 
matrix Z = XUi{r) with only r factors (in addition to the intercept). Next, we 
turn to the penalized problem (12.101) on (Y, Z): Get the solution path from run- 
ning Algorithm (|2.11l) . and tune the regularization parameters to find the optimal 
estimate, denoted by B (Ao). Our final coefficient matrix estimate is given by 

Ui{r)B (Ao). A small number of new predictive features can be constructed (and 
decorrelated) based on (12.191) . 

According to this scheme, the sample size of the reduced problem on Z is 
large relative to the reduced dimension. It is not difficult to show that for n > p, 
the update in (12.111 ) is essentially a contraction, and so Algorithm (12.111 ) converges 
fast. 

A crucial assumption here is that the rank of the true model, denoted by r*, 
is very small, compared with the sample size n. This makes it possible to choose 
a safe rank constraint value r in (11.21 ). which, though possibly much less than p, 
is still much larger than the true r*. Hence the computational cost of obtaining a 
solution path according to Algorithm (12.111 ) can be effectively reduced with little 
performance loss. This idea shares similarity with the variable screening (Fan & 
Lv 2008) proposed in the context of sparse variable selection. In the process of 
screening, all relevant variables should be kept, while in feature space reduction, 
only the necessary factors, being linear combinations of the original predictors 
and typically as few as a handful, are required to lie in reduced feature space we 
project X onto. 

In implementation, we further adopt a path-following (annealing) idea to re- 
duce computational load and avoid greedy reduction. Define a cooling schedule 
^it) (0 < t < T) with r(0) = p and r(T) = r, where r is an upper bound of 
the target rank. We conduct progressive feature space reduction as follows. (As 
aforementioned, Z° refers to Z without the first column, B° refers to B without 
the first row, and U° refers to U without the first row and the first column.) 

1. Lett^O, Z^X, t/^/. 

2. Iterate until r(t) < r: 
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(a) Set the rank constraint value to be r{t) and perform the update (13.21) 
on (Y, Z) for at most M times (with M pre- specified); 

(b) Obtain the left singular vectors of the current slope estimate B°, de- 
noted by C/i(r(t)); 

(c) Let Z° ^ Z°Ui{r{t)), U° ^ U°Ui{r{t))\ 

(d) t^t + l. 

At the end, Z is delivered as the new design, and the orthogonal matrix U gives 
the accumulated transformation matrix. 

The previously described prototype reduction scheme corresponds to r(t) = r 
for any t. With an annealing algorithm design, the dimensionality of the fea- 
ture space keeps dropping; the B involved in (13.21) has only r(t) columns. A slow 
cooling schedule with a small number of M is recommended. Based on our exper- 
iments, it is not too greedy and is usually computationally affordable for large-p 
problems. 

4 Data Examples 

We use two real data examples to illustrate the proposed methodology for dimen- 
sion reduction and supervised feature extraction. 

Example 1. First, we make a practical comparison of the rank penalized esti- 
mators from solving (11.11 ) and the rank constrained estimators from solving (11.21) 
by use of a zipcode dataset. The whole dataset (available at the website of Hastie 
et al. (2001)) contains normalized handwritten digits in 16 x 16 grayscale images. 
The digits were originally scanned from envelopes by the U.S. Postal Service and 
have been deslanted and size normalized. The space of pixel predictors is of di- 
mension 256. We standardized all such predictors. The intercept term is included 
in the model and is always unpenalized. We introduced m = 9 indicator response 
variables for digits 0-8, using 9 as the reference class. 

The training set is large in comparison with p and m (7291 images). We chose 
a subset of n = 300 at random in this experiment to compare the penalized solu- 
tion path and the constrained solution path. No additional Frobenius-norm penalty 
was enforced. The prediction results of the estimates are shown in Tabled evalu- 
ated on 2007 test observations. 
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Table 2: Rank constraint vs. rank penalty. Misclassification rates of the constrained 
and penalized reduced rank logistic regressions (RRL^^) and RRL^^)) are shown for the 
zipcode (sub)dataset where p = 257, n = 300. The rank r controls the # of newly 
constructed features. 



r 


1 


2 


3 


4 


RRL^^) 

RRL(p) 


66.52% 
66.52% 


55.06% 
55.06% 


38.47% 
38.32%, 38.37% 


33.83% 
33.83% 


RRL^'^^+SVM 

rrl(p)+svm 


59.24% 
59.24% 


47.48% 
47.48% 


33.58% 
33.63%, 33.58% 


30.79% 
30.79% 


5 


6 


7 


8 


9 


24.86% 
24.81%, 24.86% 


22.27% 

22.42% 


21.33% 
21.47%, 21.33% 


21.33% 
21.33% 


20.43% 
20.43%, 19.13% 


23.02% 
23.02% 


20.38% 
21.08% 


20.43% 
20.33%, 20.43% 


20.28% 
20.33%, 20.28% 


18.53% 
18.53%, 15.84% 



From the table, at certain values of r, the rank penalty offered more candidate 
models along its solution path than the rank constraint. Note that these rank-r 
estimators may behave differently in prediction and feature extraction. For p ~ n 
or p > n, this phenomenon is commonly seen. With an appropriate parameter 
tuning strategy, the penalty form gives better chances to achieve a low error rate. 

Of course, this comes with a price in computation. In our experiment, the 
time for obtaining the RRL^'^) path was less than one minute, while computing the 
RRL*^p) path, with a 50-point grid for A, took about four minutes. 

There is no obligation to predict through the obtained estimator; perhaps more 
useful is the much lower dimensional feature space yielded from such an estima- 
tor. Fancier classifiers such as SVM can be applied with the new features auto- 
matically extracted and decorrelated via (12.191) , and result in lower error rates as 
shown in the table. 

Finally, we add a comment that in some situations there may exist no penalized 
solution at certain rank values. Yet with a large A-grid chosen, the performance of 
the penalized estimator (after parameter tuning) does not seem to be worse than 
that of the constrained estimator. 

Example 2. The Computer Audition Lab 500 (CAL5 0) dataset is collected by 
TumbuU et al. (2008) and involves 502 Western popular songs by different artists 
selected from the past 50 years. Digital audio files were played to students to an- 
notate these songs with m = 174 words representing emotion, genre, instrument. 
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vocals, etc. The concepts characterized by the words are not mutually exclusive 
and one song can be annotated with multiple labels. This is called multi-label 
data in machine learning. The predictors are MFCC-Delta audio features from 
analyzing a short-time segment of the audio signal. TumbuU et al. (2008) used 
68 such feature vectors. To allow for interactions between these audio features 
and to make a more challenging problem, we consider a full quadratic model 
including all main effects, quadratic effects, and pairwise interactions. Hence 
p = 68 + 68(69)/2 + 1 = 2415. We split the data into two halves and used 
n = 251 songs for training and the other 251 for testing. 

For this small- sample- size-high-dimensional problem, the SVM using all 2415 
predictors gave a total misclassification rate of 21.2%, which is not all bad. On 
the other hand, the proposed reduced rank methodology can be applied to auto- 
matically construct new predictive audio features, possibly much fewer than 2415. 
The supervised nature is important because only the audio features helpful in an- 
notation (classification) are truly meaningful in this learning task. 

First, we conducted the progressive feature space reduction introduced in Sec- 
tion [3l with the upper bound of the target rank set to be 20. Then we ran Algo- 
rithm (12.111 ) to fit a penalized reduced rank vector logistic regression with the 
20 extracted features. The rank-Frobenius penalty was chosen due to serious 
coUinearity arising from the high-dimensional quadratic model. The parameters 
were tuned by 5-fold PCV with BIC correction. 

Surprisingly, our final estimate B has rank{B°) = 2, which gives a dramatic 
dimension reduction from 2514 to 2. But the SVM trained based on just the two 
new features yielded an improved error rate of 14.13%. In fact, even using the 
vanilla reduced rank estimator, we can achieve an error rate of 14.36%. The per- 
word precision and recall (cf. TumbuU et al. (2008) for the detailed definitions) 
are, respectively, 35.6% and 8.7% on the test dataset, comparable to the rates of 
the two advocated approaches in TurnbuU et al. (2008). But our model is more 
parsimonious and creates two concise audio summary indexes for semantic anno- 
tation. 



5 Conclusion 

Supervised linear feature extraction can be obtained from a reduced rank vector 
model. We studied rank penalized and rank constrained generalized linear models 
and discussed how to adapt them to feature extraction and feature space reduction. 
The latter technique helps to reduce the computational cost significantly in high 
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dimensions. We also noticed the strong nonconvexity of such problems raises 
some serious issues in data-resampling based parameter tunings, but the proposed 
projective cross-validation works decently in general and is efficient. Through 
reduced rank modeling, dimension reduction can be attained if the rank of the 
model is small relative to the number of predictors. The work can be viewed as a 
supervised and parametric generalization of the principle component analysis. 



A Proof of Proposition \2A 



To prove Proposition [2111 we first introduce two lemmas. 

Lemma A.l (von Neumann (1937)). Let A, B be two n x n matrices. Then 

\Tr{AB)\<Y,<Ay^{B), (A.l) 

where cri(A) > cr2(A) > ■ ■ ■ > cr„(A) and ai{B) > a2{B) > ■ ■ ■ > crn{B) are 
ordered singular values of A and B respectively. 

We refer to Grigorieff (1991) for an elementary proof. 

Lemma A.2. Given a thresholding rule 6, let P be any penalty satisfying condi- 
tion (|2.3I) in Proposition \2.1\ Then, the univariate minimization problem v[vva.Q{t — 
OY /2 + P{0] A) has a unique optimal solution 9 = 0(t; A) for every t at which 
6(-; A) is continuous. 

Proof Apply Lemma 1 in She (2012). D 

Proof of the optimality part of Proposition \2.1\ Let Y E M"^™ and assume n > m 

rp rp 

without any loss of generality. Let Y = UqDqVq and B = UDV be the 
SVDs where Dq = diag((io,i) and D = diag(di) with (io,i > (io,2 > ■ ■ ■ > c^cm 
and di > d2 > ■ ■ ■ > dm- Clearly, 

\\Y-B\\l = \\Y\\l+\\B\\l-2Tr{Y^B) 

= \\Y\\l+\\B\\l-2Tri[Y Of[B 0]), 

where [B 0] G M"''" and [Y 0] G M"''". It follows from Lemma |Aj] that 
TriY^B) < Y^da^idi- Hence 

F{B) > {\\D4l + \\D\\l - 2Tr{DoD))/2 + J]P(d,; A) (A.2) 

= J](4,-t/,)V2 + 5^P(rf,;A). 
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Now the problem reduces to 

mill J2ido,i - diY/2 + J2 Pidi; A). 

The optimal solution B then follows from Lemma lAT2l D 

The argument above only implies the singular values of B are unique (up to 
permutation). Although one can possibly argue the uniqueness of B by studying 
the condition under which equality is achieved in (IA.2I) . another formal proof of 
the uniqueness is deferred to Appendix |Bl 



B Proof of Proposition |23 



LetB = B + A. Suppose Y = UoDqV^, B = UqDV^, and B = UDV^ 
are the SVDs. We have 

\\Y-B\\l/2-\\Do-D\\l/2 
= -Tr{B^Y)+Tr{DoD) 

= -Tr{B^{Y - B)) + Tr{{Do - D)D) + Tr{DD) - Tr{B'^B) 
= -Tr{VDU^Uo{Do - D)V^) + Tr{{Do - b)D) + Tr{Db) - Tr{B^B) 

By Proposition |2T|£) ^ Dq, i.e., Dq — ID is positive semi-definite. By augment- 
ing Y — B and B and applying Lemma lATTl we can prove 

Tr{D{Do -b)> Tr{VDU^Uo{Do - b)V^), 

from which it follows that 

\\Y -B\\l/2-\\Do- D\\l/2 > Tr{Db)-Tr{B'^B) 

> Ci{Tr{Db) -Tr{B^B)). 
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Now we have 

Q{B)-Q{B) = ||r-S|||/2-||r-B|||/2 + 5^Pe(f/.;A)-5^Pe(ci"i;A) 

> \\Do- D\\l/2 -\\Do- b\\l/2 + J2 Mdu A) - J^ Pe{k A) 

+CiTr{DD - B^B) 
= Yl ((^0.* - ^*)V2 + Peid^; A)) - ((cio,i - lY/2 + Pe(c^.; A)^ 

+CiTr{Db - B^B) 

= Ci{\\D- b\\l/2 + Tr{Db)-Tr{B'^B)) 
= C,i\\D\\l/2+\\b\\l/2-TriB''B)) 
= Ci\\B - B\\l/2. 

The second inequality is due to Lemma 2 in She (2012). D 

Proof of the optimality part of Proposition 12. 1 1 From the comment in Appendix 
lAl any optimal solution B must have the same nonzero singular values (up to per- 
mutation) as B, i.e., rfj = dj, seen from the proof of Proposition 12. 1[ A more 
careful examination of the proof of Proposition 12.21 shows Q{B) — Q{B) > 
Tr{Db - B^B) = \\B - B\\l/2. Therefore, the globally optimal solution 
B in Proposition [2T| must be unique. D 



C Proof of Theorem 2.1 



Theproofis similar to that of Theorem 2.1 inShe(2012). Define a surrogate func- 
tion G for any A = [ai, •■■,«„] = [ao, ai, • ■ ■ , o^pf and B = [f3^,- ■■ , f^m] ^ 



"•pj 

3(p+l)xm 

pAm 



III, I ii }~" \i I ii ^ 

GiB,A) = -5^5^L,,K) + 5^P(ar);A) + -l|A-Bl|| 

fc = l j=l s=l 

m n m n 

- J2 JL^K^^"^^^ - H^^/^fc)) + J2Y1 f^^AxJ^k - xj/3„), 



fc=l J=l k=l i=l 
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where fii^k = 9 ^i^JPk) = ^'i^If^k)- It can be shown that given B, minimizing 
G over A is equivalent to 



pAm 

argmin^ -\\A-[B + X^'Y - XV(S)] ||^ + E ^(^^^"^^ A). 

s=l 

By Proposition l2.1l B^^'^^' in (12.111) can be characterized by arg min^ G{B^^' , A) . 
Furthermore, we have for any A G ]R(p+i)^"^ 

G{B^^\ B^^+^^ + A) - G{B^^\ B^^+^^) > — ||A||^ (C.l) 

with Ci = max(0, 1 — Cq), by applying Proposition 12.21 and Lemma 1 in She 
(2012), and noting that g,(a(-^°''^")) = 0, for B°(-'+^^ obtained by 0'^-thresholding. 
Next, Taylor series expansion gives 

k 

= G{B^^\ B(^-+i)) < G{B^^\B^^^) - Y, y('^1'^'^ - f^k^fif^k^'^ - /3?) 

k 

(12.121) can be obtained. In fact, this decreasing property holds for any p < 2 — Cq. 
Let B^^'-^ — )• B* as / — )■ oo. Under the condition p < 2 — Cq, C is strictly 

positive and \\B^^^+^^ - B^^^^\\j,/2 < {F{B^^^^) - F{B^^'+^^))/G < {F{B^^'^) - 
ir(_B(ife+i)))/^ ^ O.Thatis, 0'^(S°(^'')+X°^r-X°V(-B^^''^); \)-B°^^'^ -^ 0. 
Therefore, B* is a solution to (12.131 ) due to the continuity assumption. D 



D Proof of Proposition 123 



Let M = {X'^Xy^/'^X'^Y and ro = rank(M). Obviously, ro < p A m. 
Note that M^M = Y^HY. Assume M = UDV^ is the SVD of M with 
U e W^'^ V e M™^^", and D e M^«^'^«. Suppose all (positive) diagonal 
entries of D are arranged in decreasing order. Let A = X^Y — X^XB. To 
prove B obeys the 6'^-equation (12.131) for hard-thresholding, it suffices to show 
that (i) there exists a p x tq orthogonal matrix [/* satisfying U'^U^, = I such 

that Ul{BB )[/* and Ul{AA^)U^ are both diagonal; (ii) there exists an m x 
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ro orthogonal matrix V^ such that Vj{B B)V^ and F^ (A^A)V^, are both 

diagonal; (iii) Tr{B A) = 0; (iv) the singular values of A are all bounded by A. 
Recall that r = max{i : di > \} and Vr = V[ ,l:r]. Introduce V^r = 
V[ , (r+l):ro], formed by deleting the first r columns in V. Then we have 



A = X'Y -X' XB = X'Y -X' HYVv, 

= X^Yil - Vv,.) = X^YVv.,. 



X'YV-rVi 



[X'^Xfl'^MV-rV^_, 



(D.l) 



Obviously, Tr{B A) = 0. (iii) is true. On the other hand, we can rewrite B as 



B = {x^xy^/'^MVrVj = {x^xy^/'^Mv 







(ro-r)x(ro-r) 



V' 



iX^Xr^'^UD 



Now we obtain 



J-rxr 







(ro-r')x(ro-r) 



V^ = (X^X)-i/2e?^(M; A). 

(D.2) 



B B = V 



A^A = V 







0, 



(ro-r)x(ro— r) 



■ (ro— r)x(ro— r) 



DU^iX^XY^UD 







v^ 



(ro— r)x(ro-r) 

(D.3) 



DU^(X^X)UD 



0, 



F^. 



(iv) is straightforward from (ID.4I) : 

0.> 



lA^Allo < 11X11^ 



t/iD 



■ (ro-r)x(ro-r) 



V' 



■ (ro-r)x(ro-r) 

(D.4) 



< 1 ■ ci^+i < Al 



(|D3]) + (lD4l) also implies (ii). In fact, introducing G = DU^{XXy^UD, 
H = DU^{XX)UD, Gu = G[l:r,l:r], H22 = if [(r+l):ro, (r+l):ro], and 
assuming the spectral decompositions of the two submatrices are given by Gu = 
U^^Df^{Uf^f and if 22 = U^^D^^iU^^Y , respectively, then. 



G 



V, 



•-^11 



'-^22 
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simultaneously diagonalizes B B and A^A and satisfies V^ V* = /. 
Finally, we construct C/^, to prove (i). From (ID .21) and (ID.ll) . 



AV. = {X'^Xf/^UD 







0. 



(ro-r-)x(ro-r) 



■ (ro-r)x(T-o-r) 



[/ 



t/ 



H 

22 



G 



•-^ 11 



u 



H 

22 



(D.5) 
(D.6) 



Let G = (X^X)-V2[/£, and H = [X'^Xfl'^UD. Then cf G = G, if H = 
H. By construction, C/f^ and C/^ must be the right- singular vectors of Gi = 

G[ , l:r] and i3"2 = -H"[ ,('"+l):'"o] respectively. Denoting by C/f and t/^ their 
associated left-singular vectors respectively, we get 



U. 



G ttH 



ux u 



-T, 



which makes both U ^ BB U^ and U ^ AA U ^ diagonal. To prove U^ is the 
desired matrix in (i), it remains to show the orthogonality of C/*. It follows from 
(ID3]) and (ID31) that 



(G,t/f,r^2c/i = (c/sr (x^x)-v2c/ip 



- rxr 





(X^Xy/^UD 







■ (ro-r)x(ro-r) 



L/ 



H 
22 



c/fi)^ [ /,x. ] Du^ix^xr'/\x^xy/' 





C/i? 



■ (ro-r)x(ro-r) 



t/ 



(C/fi)^ [ /,xr ] DU^UD 





-* {r'o~r)x(ro— r) 



•-^22 



(C/fl)^ [ /,xr ] £)' 







■ (ro-r)x(ro-r) 



c/ 



0. 



Since Gu and i/22 are positive definite (noting that D E 



nroxro : 



IS nonsingu- 



lar), we further obtain (Ui) U2 = 0. Hence U^ t/* = /. The proof is now 
complete. D 
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E Proof of Proposition 1X4 

From Theorem 12 .11 B satisfies 




-fx{B;X)fxo. 



T~ (E.l) 



Here, we write n{B; X) to emphasize the dependence of the mean matrix on the 
design. In this proof, we use the same submatrix notation as in Appendix iDl 

Given the SVD B° = UDV^, by Definition 12.21 there exist orthogonal ma- 
trices U and V, as augmented versions of U and V, respectively, i.e., U = 
C7[/,] and V = V[I,] for some index set /, such that B° = LfEV^ and 
X°^Y - X°^n{B; X) = UWV^ are both the SVDs. Clearly, S[/, /] = D, 
I][J'^, I^] = 0. Using the hard-ridge thresholding (12.61) . we rewrite the first equa- 
tion in (IE. II ) as 

(1 + r])B° + A(l + r])USV^ = B° + X°^{Y - /x(B; X)), (E.2) 

where S is diagonal and satisfies S[I,I] = and S[i,i] < 1 for any i E P. 
Left-multiplying both sides of (IE. 21) by U^ yields 

r]DV^ = U^X°^{Y - n{B, X)). 

On the other hand, from the construction of C and Z, it is easy to verify xjB = 
zJC, from which it follows that /u(S; X) = n{C; Z). Therefore, C satisfies 



^C° =Z°^{Y-iJi{C-Z)) 
={Y- fi{C; Z)f~Zo. 



T. (E.3) 



Noticing that the optimization problem in (12.201 ) is convex and (IE.3I ) gives its KKT 
equation, the conclusion follows. D 



F Proof of Theorem 3. 1 



Lemma El. Given any Y G M"^, B = Q*'^{Y;r,ri) is a globally optimal 
solution to 

mm -\\Y - B\\l + -\\B\\l s.t. rank{B) < r (Rl) 

B 2 2 
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Proof. The problem is equivalent to minimizing 

subject to rank{B) < r. Applying Lemma IATI yields the result. D 

The remainder of the proof follows the same lines to the proof of Theorem 
12. 1[ See Appendix O for details. 
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